Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
The emergence of 1-bit large language models (LLMs) has sparked significant interest, promising substantial efficiency gains through extreme quantization. However, these benefits are inherently limited by the portion of the model that can be quantized. Specifically, 1-bit quantization typically targets only the projection layers, while the attention mechanisms remain in higher precision, potentially creating significant throughput bottlenecks. To address this, we present an adaptation of Amdahl's Law specifically tailored to the LLMs, offering a quantitative framework for understanding the throughput limits of extreme quantization. Our analysis reveals how improvements in quantization can deliver substantial throughput gains, but only to the extent that they address critical throughput-constrained sections of the model. Through extensive experiments across diverse model architectures and hardware platforms, we highlight key trade-offs and performance ceilings, providing a roadmap for future research aimed at maximizing LLM throughput through more holistic quantization strategies.more » « less
-
This paper investigates the combined potential of neuromorphic and edge computing to develop a flexible machine learning (ML) system designed for processing data from dynamic vision sensors. We build and train hybrid models that integrate spiking neural networks (SNNs) and artificial neural networks (ANNs) using the PyTorch and Lava frameworks. We explore the effects of quantization on ANN models to assess its impact on both accuracy and energy efficiency. Additionally, we address the challenges of deploying hybrid models on hardware by implementing individual components on specific edge platforms. We also propose an accumulator circuit to bridge the spiking and non-spiking domains. Comprehensive performance analyses are conducted on a heterogeneous system of neuromorphic and edge AI hardware, assessing accuracy, latency, and energy consumption. Our results show that hybrid spiking networks improve accuracy and energy efficiency. Moreover, we find that quantization improves hybrid networks, further reducing energy consumption while boosting accuracy.more » « less
-
Peng, Lu; Vaisband, Boris; Chen, Fan; Zhou, Peipei; Kvatinsky, Shahar; Xie, Jiafeng (Ed.)In this paper, we propose the CrossNAS framework, an automated approach for exploring a vast, multidimensional search space that spans various design abstraction layers—circuits, architecture, and systems—to optimize the deployment of machine learning workloads on analog processing-in-memory (PIM) systems. CrossNAS leverages the single-path one-shot weight-sharing strategy combined with the evolutionary search for the first time in the context of PIM system mapping and optimization. CrossNAS sets a new benchmark for PIM neural architecture search (NAS), outperforming previous methods in both accuracy and energy efficiency while maintaining comparable or shorter search times.more » « less
-
In this paper, we present a novel hybrid computing architecture designed to accelerate inference in 1-bit large language models (LLMs). Our approach combines the strengths of analog in-memory computing (IMC) and digital systolic arrays to address the diverse precision requirements across different layers of 1-bit LLMs. Specifically, we utilize analog IMC to accelerate low-precision matrix multiplication (MatMul) operations within the projection layers, which are naturally amenable to extreme quantization. Meanwhile, digital systolic arrays are employed to efficiently handle high-precision MatMul operations in the attention heads, preserving accuracy where precision is most critical. By partitioning the computational workload based on precision needs, our hybrid architecture increases throughput and energy efficiency. Experimental evaluations demonstrate that our design delivers up to an 80x improvement in tokens processed per second and achieves a 70% increase in energy efficiency (tokens per joule) when compared to conventional digital hardware accelerators.more » « less
An official website of the United States government

Full Text Available